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ABSTRACT 

Classical, statistical methods and the small 
enrollments in graduate departmentsj»ave constrained the Graduate 
Record Examinations (GRE) Validity Study Service to providing only 
validities for single predictors. Estimates of the validity of two or 
more predictors, used jointly, are considered too unreliable because 
the corresponding prediction equations often possess implausible 
characteristics. This study investigates two statistical 
methods— empirical Bayes and cluster analysis— to determine their 
applicability to these validity problems. Data on 6,946 students from 
190 participating departments were used. It is concluded that, by 
using the new class of empirical Bayes methods, it is possible to 
obtain, at the department level, useful and reliable estimates of the 
joint validity of several predictors of academic performance. Further 
methodological refinement will allow the question of differential 
predictive validity to be addressed at the departmental level. The 
technical appendix describes the estimation problem following from 
the empirical Bayes model. (BS) 
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\ Executive Sumary 

Graduate education in the United States is characterised by an 
enormous diversity of disciplines and the predominance of relatively 
small enrollments in individual departiKnts. In this setting, a 
validity study based on a single department's data and employing 
clas^sical statistical methods can be of only limited utility and 
applicability. At present « to participate in the Graduate Record 
Examinations Validity Study Service, a department must have at least 
25 students in its entering class. Only validities for single 
predictors are provided; estimates of the validity of two or more 
predictors, used jointly, are considered tou unreliable because the 
corresponding prediction equations often possess implausible char-* 
acteristics, such as negative coefficients* These constraints were 
introduced by the Validity Study Service to reduce the chance that 
the results in the report to a department would be overly influenced 
by statistical artifacts in the data and hence serve more to mislead 
than to inform. Two unfortunate consequences^ ho«#ever, are that 
fewer departments than before can benefit from the service and those 
that do cannot obtain information on a number of issues of interest* 
For example, questions of the incremental contribution to validity 
of one predictor when two or more predictors are alr^dy included in 
a prediction formula cannot be answered. Moreover, evqn the single 
predictor validities display considerable year^to'^ye^r fluctuation, 
which may bewilder departments that regularly receive Validity Study 
Service reports. 
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Until recently, little could be done to ameliorate this unsatis- 
factory state of affairs. Although a ninber of strategies for pooling 
data across departments have been proposed, they all suffer from at 
least one drawback: Either the pooling is so indiscriminate that 
the results appear to have qi^stionable relevance to an individual # 
department or it is so delicate that it would appear to be difficult 

# 

to implement in an operational setting. 

The goal of the present study was to investigate two statistical 
methods, empirical Bayes and cluster analysis, to determine %fhether 
their application to the problems faced by the Validity Study Service 
could result in useful impioven^nts . Because of the successful appli- 
cation of empirical Bayes iKthods to validity problems in another 
context, particular emphasis was placed on this approach. In fact, 

considerable effort was expended in developing and studying a new 

.I 

and more general class of empirical Bayes models that can acconmiodate 
the complex structure of the Validity Study Service data base. 

To borrow a term from sociology, empirical Bayes facilitates a 
very general form of ''contextual ^nalysis" of the validity problem. 
Essentially, the relation between the criterion and a constellation 
of predictors within a given department is examined in the setting * 
of a large collection Oi. departments. Of particular interest is any ^ 
evidence that the nature of this relation varies in association with 
soTi^ measured characterist ic(s) of the departments* An example 
might be the finding that the inclination of the regression plane 
increases as the departTOnt size increases. To the extent that such 
pandepartmental findings are valid, the precision of the estimation 
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carried out in any one department can he improved by drawing upon 
the information provided by the other departments. 

Two technical points must be satisfactorily addressed before 
such a procedure can be implemented. 'Die first is to determine how 
much the estimates based on a single department *s data should be 
mod|,fied by data frcnu other departments. The empirical Bayes 
methodology provides a good solution that depends on both the 
precision of the wi thin-departiwnt estimate and the apparent strength 
of the pandepartmental relation. Details are given in the text. 
The second point concerns which departs^nts are to be considered 
together. Various possibilities exist, including the formation of 
clusters of departments on the basis of either substantive or 
statistical criteria. The empirical Bayes methodology would then be 
applied separately to each cluster. An extreme approach is not to 
form clusters at all but to treat all graduate departments partici- 
pating in the Validity Study Service as a single "family," trusting 
that certain measured departmental attributes are sufficient to 
characterize the departments in the validity setting. 

Our analysis demonstrates that empirical Bayes does provide a 
usetul way of combining information across departments. Interest ingly« 
it appears to work best, in this case ^ when departments are character- 
ised by various measures of student quality but are not divided into 
clusters determined by discipline^ location on a verbal-quantitative 
axis, or various other statistical criteria. The practical result 
is that, even for departments with as few as 10 students, separate 



7 



prediction equations have been obtained from which stable estinates 
of the joint validity of two or nore predictors can be derived. 

An important ccaponent of the study was the comparison of various 
Suggested approaches along a nui^er of di^nsions, especially those 
related to the quality of their predictions. The quantitative 
nature of these comparisons may seem somewhat at oddc with the way 
in which most graduate departments probably utilise the results of a 
validity study. Rather than using the prediction equations provided 
to make exact predictions of first-year averages to be earned by 
prospective candidates, they look for guidance on the relative 
weights to be assigned to various predictors in making a qualitative 
assessment of the candidates. However, in our view, there can be 
little justification in proposing that empirical Bayes methods 
supplant classical least squares unless it can be shown that, among 
other things, the change would result in prediction equations that 
yield better predictions and are more stable through ti^. 

Among the approaches compared were ordinary least squares, 
least squares in conjunction with pooling of data, and a variety of 
empirical Bayes methods in conjunction with different levels of 
clustering of departnwnts. To simulate the admissions setting, 
most of the comparisons were carried out through cross-validation: 
Each department was randomly divided in half; models were estimated 
with one half-sample, the calibration sample, and tested on the 
other half, the validation sample. The comparisons desnnstrated 
that a fairly simple empirical Bayes nK>del not only yielded better 
predictions of first-year averages but also facilitated the accurate 
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assessn^nt , a priori, of the quality of these predictions. Moreover, 
the prjrdiction equations vere quite stable and rarely displayed 
implausible features such as negative weights* 

How do our results compare with those reported by Livingston 
and Turner (1982)? Although most of their report centers on sero-order 
correlations, they do report joint validities for verbal and quanti- 
tative scores and undergraduate grade point average, estimated either 
by pooling data across disciplines or by forcing the predictors to 
he equally %re4ghted. These validities fall in the range of 0.23 to 
U.43 and are somewhat lower than the validities that can be realized 
using prediction equations obtained through empirical Bayes • These 
range from 0.30 to 0.55. 

In another phase of the research program, a new statistical 
method for hitrrarchical clustering was developed and applied to 
applicants* characteristics. These characteristics include 
performance on the GR£ General Test and the Subject Test taken. 
Both the resulting clusters and the original candidate information 
were employed in a number of ways in estimating validity, but the 
results did Dot offer an improvement over the empirical Bayes model. 
N< netheiess, the new clustering of disciplines may be of uamt 
interest in its own right. 

The empirical Bayes n^thod has also been adapted to the problem 
of providing separate estimates of validity for various subgroups of 
a given population of students* Two trials were carried out: one 
in which students were categorized bp race and another in which they 
were categorized by both age and sex. In both cases, the differences 



-8- 



in validity uete neither of practical importance nor statistically 
significant* However, the relatively small aiMunt of data on 
minorities suggests that real differences, if they existed, vould he 
difficult to uncover. 

The principal methodological conclusions of this study are 
that, through the use of a nev class of empirical Bayes iMthods, it 
is possible to obtain, at the departmental level, useful and reliable 
estimates of the joint validity of several predictors of academic 
performance and that these oethods may be further refined to address 
the question of differential predictive validity, again at the 
departiiK-ntal level. These results have important practical implica- 
tions for the GR£ Validity Study Service, 
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I • Introduction 

The research debcribed in this report vas undertaken both to 
develop solutions to Boma basic problems that have plagued validity 
studies of Graduate Record Exaainations (GRE) scores and to broaden 
the range of questions the GRE Validity Study Service might usefully 
address* 

To examine the validity of a single masure as a predictor 
of first-year average (FYA) in a graduate department, the standard 
approach has been to carry out a simple regression of the criterion, 
FYA, on the measure, based on data culled from a recent cohort of 
students successfully completing their first year. Similarly, a 
multiple%eg*^ession must be performed to assess the joint validity 
of two or more predictors. In either case, the appropriate validity 
(or correlation) coefficients may be easily derived from the fitted 
rt^gressions . 

Unfortunately, these validity estimates are affected Dy the 
very nature of the process by which students are admitted to graduate 
school* In particular, if the predictors whose validity is to be 
assessed are employed in the selection process, the attending students 
will display a distribution of test score values that differs from 
that of the applicants. For example, the proportion of attending 
students with low test scores will usually be smaller than that of 
applicants. The corresponding validity estimates will tend to be 
lower than they would otherwise be in an unselected population. 

To this difficulty, encoun&ered in virtually all validity 
studies, must be added the small size of most graduate departTCnts. 




Soall sample sises resuU in unstable estisates of validity — that 
is, estimates that may fluctuate wildly from yeai to year. In 
addition, the selection process will also defend on a niusber of 
unmeasured or unrecorded factors (letters of recMBMndat ion, 
extracurricular activities, and the like) that are not perfectly 
correlated with those available for study. Fluctuations in the 
quality of applicants on these unavailable factors across years 
will also contribute to the apparent instability of the fitted 
regressions . 

Consequently, estimates of validity derived under these circum* 
stances may be of limited utility. The present Validity Study 
Service has taken a number of steps to mitigate the effects of these 
difficulties. At one time, departiKnts %rith as few as 10 students 
could participate in the service. Nov they must have at least 23 
students* Second ly, only the zero-order correlations are reported. 
These are the validities of each factor taken alone. The drawback 
to the first ^tep is that many fei^r departments may avail them* 
selves of the service. The drawback to the second is that the joint 
validity of a set of predictors cannot usually be accurately inferred 
from their individual validity coefficients. Moreover, the «ero-order 
correlations may be quite misleading ii two or more of the factors 
were important elements in the selection process. 

Nonetheless, the course ad: Sy the Validity Study Service 
is a wise one inasmuch as Bold^ ) has reported chat attempts to 

fit multiple regressions to GRE data result in numerous i^gative 
coefficients for the predictors, implying that the better the 
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performance on Che pr^ictoty the poorer the perforraance in graduate 
school I Such results are undoubtedly artifacts of the selection 
process and small sample sises described above. A number of 
researchers have wrestled with these problems. Generally, their 
strategy has been to devise sensible ways of combining ^informat ion 
across de^artircnts* 

One of the first applications of Bayesian techniques to these 
problems was reported by Novick, Jackson^ Thayer» and Cole C1972). 
Building on a series of papers by Lindley (1969, 1970)| they carried 
out a cross-validation study on a set of American College Testing 
Program data. The results suggested that the Bayesian approach 
outperformed ordinary least squares on a wi thin^departs^nt b&bis. 
An excellent account of related statistical work up to 1973 is 
provided by Boldt (1975). 

Boldt compared a central prediction approach, which applies 
a least squares technique to College Board Validity Study Service 
data appropriately pooled from different sources, to the Bayes 
approach of Novick et al. He reported that least squares performed 
about as well as Bayes in a cross-^validation analysis of prediction 
performance. Wilson (1979) used a central prediction system of 
common weights for departUM^nts and suggested methods for testing 

c 

wtiether an individual department had a different prediction system. 

Data were not available for cross-validation of this method. Using 

fi 

GRE^validity data, Wilson found by analysis of variance that only a 
small fraction of departments appeared to have different prediction 
systems • 
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Using Lav School Admission Test validity data, Rubin (1980) has 
reported that in a cross-validation of oultiple R^, eapiri^al 
Bayes techniques outperformed the ordinary least squares (within- 
school) techniques, (tae of^ the significant results was that^ 
although each individual law school data set was large, the least 
squares prediction systeas were shown to be unstable from year to 
year when coiapared to the empirical Bayes prediction systCTS. The 
policy of some departments in allowing high scores on one test to 
compensate for low scores on another, studied by Dawes (1975), has 
been shown to severely hffi&per estimation of the underlying predictibn 
system %rhen using a single department's data and is one cause of the 
instability mentioned above. 

Braun and Jones (1981) have studied prediction bias in the 
context of the Graduate Manag^ent Admission Test for Blacks in a 
data set in which only 6 percent of the students were Black. Using 
empirical Bayes techniques based on Dempster, Rubin^, and Tsutakawa 
(1981), they obtained prediction systems together with appropriate 
confidence statements for Blacks even in schools having less than * 
five Black students. Similar techtfiques should be applicabl'e to the 
data of the GRE Validity Study Service. 

A requirement of the methods described above has been the identi- 
fication of homogeneous groups of departments. Because of the diversity 
of graduate department types, attempts at^ clustering have resulted 
either in a large nimber of small groups or a small niosber of rather 
heterogeneous groups. In eilfher case, the decision is often made in 
a subjective fashion, (hir present approach has been to extend the 
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empirical Bayes methcHlology to accamaod&te the structure of the GRE 
Validity Study Service data base with a view to possibly dispensing 
with the iieed for such prior cluster formation. We have also 
considered empirical Bayes prediction systems that do utilize 
departmental clusters in an important way. Nonetheless, we believe 
Lhac the Analysis described below conclusively demonstrates that a 
general empirical Bayes framwork is sufficient to produce a superior 
prediction system witht^t recourse to any subjective grouping of 
departu^nts . 

A brief description of the data base we have worked with is 
provided in the next section. In Section 3, our empirical Bayes 
models are explained and compared with current procedures. The 
application of empirical Bayes methods to the problem of determining 
the predictive validity of GRE scores for various subgroups of 
students is addressed in Section A, Section 3 considers alternative 
approaches to the clustering of departments, and the final section 
contains some discussion of the validity of the GRE score battery, 
based on inferences from the preferred model, as well as suggestions 
for future work. 
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2. Data 

The in£or»ation collected frc» participating departments by 
the GRE Validity Study Service from its inception through the spring 
of 1981 provided the essential data for this stud). The information 
arrived in tvo trays. A ''Prerecorded Data Collection Fona*' is sent 
by the service to esrh graduate depar^t8ent, containing the naiMS of 
students who had test scores s^nt to the department i which then 
supplies for eac»i student the year of enrollment, und^ jgraduat^ 
grade point average (UGPA), and <^raduate first-year average (FYA). 
Using an "Add-On Data Collectic the departo^nt may supply 

information on additional students that enrolled but were not on the 
first list. The Validity Study Service then matches the two lists 
against the original score sender files, producing a t ngle file 
containing score information as wel^ as se<, ethnic tu itus, native 
lliflguage , and hanoic^p status for each student. 

The original file of 8,224 students was reduced to 6,946 
students from some 190 departments because of the requirenient of 
full information on the predictor scores and criterion* In sub^ 
sequent sections, when we execute cross-validation procedures or 
carry out analyses of differential predictive validity studies for 
various subgroups, small sample sizes or missing nformation on 
individual characteristics will result in further reductions in the 
data base employed* These reductions will be described in the 
appropriate sections. 

In our work on clustering, we employed the **Graduate Institu*- 
tion Summary Statistics Reports" issued by the GRE program to all 

16 
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ifistitutlons who received GRE scores for a giveu cesciftg year. 
These reiMJrts siwnaiiM for each departaental type the distributions 
of ORE verbal and quantitative scores, the GRE Subject Tests taken, 
and the Subject Test scores of all applicants to departaents of that 
type. The swHiaary data are based on all scores reported between 

October IS78 and October 1979. 

The data were seeled prior to carrying out any ccsaputations. 
The GRE scores we.e divided by 200 to oake the« roughly comparable 
to undergraduate grade point average, /-nd the first-year averages 
were scaled separately for each department by suL racting out the 
mean and dividing by the standard deviation. Because the standard 
deviation can be quis..» variable, especially in small samples, this 
scaling introduces an extra measure ot noise. On the other hand, 
the assumption of exchangeability of the regression coefficients is 
niore pl..j9ible when the criterion scores have been standardised in 
this manner, and r.he empirical Bayes estimation procedure mitigates 
a.ost of the effects of the added k.-lse. 
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3. &npirical Bayes Models for Prediction 
3, I Introduction 

The nature of graduate education in the Uni^ted States » most 
notably the diversity of departments and the relatively sraall 
enroLlaentSi has forced researchers to adopt a nimber of c<sprraiises 
in their analysis of the validity of the GRE score battery and 
college grades • For example, departi^nts with fewer than 10 (or, 
more recent ly, 25) students have been excluded from studies carried 
out by the GRE Validity Study Service • Moreover, only xero-order 
correlations have been presented since it appears that the estimates 
of coefficients in a multiple regression are too unstable* 

Another approach has been to pool data across a group of depart- 
ments and to estimate a single prediction equation for the ensemble* 
These groups could be formed on a quasi-subjective basis (for 
example, all psychology departments) or on a statistical basis by 
U'sting for the equality ot' regressions (Wilson, 1979). Whatever 
the method, a number of problems have persisted, including a large 
proportion of negative coefficients and serious instability of 
est i mated coefficients . 

Our own efforts have been centered on applying empirical 
Bayes methods to the question. In a sense, empirical Bayes provides 
an attractive compromise between the extreiwa of locally determined 
regressions for each participating departiKnt and a comson regression 
for a group of departments* In fact, empirical Bayes does require 
that a group of departn^nts of like nature be identified. Then, 
under certain assumptions (specified below), separate estimates for 

18 
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each departsent are developed. These estimtess however « are 
cofld^inations of the least squares estimates based on that depart- 
ment *b data alone and a pooled esti^te bas^ <m the data fros all 
schools in the group* The exact nature of the craibination is 
determined by the estimated precision of the least squares estimate 
and the | »oled estimate* Stability of the final individual estimates 
is enhanced by the contributions of the pooled estimateSy ^ile 
departmental idiosyncrasies are given soim weight through the least 
squares estimates, 

3.2 Developing Sapirical Bayes ttodels 

While the analysis to be described in succeeding sections is 
fairly complex^ the essential notions of the empirical Bayes methcNl-' 
ology can be illustrated by a simple example. 

Suppose ve wish to study by linear regression methods the 
relation between a criterion Y and a single predictor X in a collec- 
tion of ins t itut ions p m in number. Thus, ^ postulate a simple 
model (with no intercept) of the form 
(1) Y.. ■ B.X.. + e.. i ■ If •••» » 

where 

Y.* • criterion score of student j in institution i 
X. . * predictor score of student j in institution i 
e^. associated residual error 

- slope of regression in institution i. 
In classical statistics, the parameters are estimated on the basis 
of the data in institution i only. The particular statistics employed 
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in the estimation depend on the assiaptiona aade concerning the distri^ 
bution of the residual errors e... 

In the empirical Bayes approach, another assiasption is Bade: 
nraely, that the true slopes B. nay be thought of as being randcHsly 
generated from 8<»e prior distribution. Although the form is 
specific beforehamly the parameters (called hyi^r parameters) of 
this distribution are left free to be estimated from the data. 
These hyperparweters, in turn, are used to provide estisMites of B*^ 
that differ from those of a classical analysis. For example, ve 
might assume 

(2) N(y*, Z*) 

where and Z* are the free scalar hyperparameters that must be 
estimated from the data. 

The key implication of the empirical Bayes assumption is that 
the data in one set of institutions contain information about the 
value of the slope in each individual institution. Thus, a mechanism 
is provided whereby the information in the entire data set can be 
employed in the estimation of the individual slope parameter. The 
practical consequence » in the present context , is that the empirical 
Bayes estimate is derived by pulling the iiulividual estimate towards 
the estimate derived by pooling the data over institutions. The 
greater the estimated imprecision in the individual estimate, the 
more it is modified. 

In mathematical terminology, the proper application of empirical 
Bayes methods depends on the assimption that the units of the analysis 
(in this case, the departmental regression coefficients) constitute 
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an exchangeable group of unitat Asaiaptioii (2) ia one very special 
vay of forsulacing this aaauapt^on. Eaaeotially it ueana that there 
ia no reason to helieve, a priori , that the value of the regression 
coefficients for one department are lar<* r or »aller than those for 
another department « 

Most researchers, ve be i^'-e, ifould be quite coafortable with 
the assumption of exchM^cabil ity for a group of like departments, 
such as a groi*p of anthroiMlogy departments, knoiro to be equally 
competitive and #ith similar grading standards* However , if we 
wished to treat similtaneously a more heterogeneous group of depart- 
ments, such as all physical science departments, exchangeability or, 
in particular, assumption (2) would no longer appear very plausible* 
Nevertheless, by generalizing (2), it is possible to remain in the 
empirical Bayes framework* 

For example, imagine that a set of the regression coefficients 
is not exchangeable because their values vary systmatically with 
some other departmental factor, Z* That is, suppose that in place 
of (2), we have 

(3) 8. - Z.G ♦ D. 

where G is an unknown coefficient, and D is a random error for which 
it is assumed that 

(4) D. i^** N(U,2:*). 

Just as in ordinary multiple regression, the coefficient C determines 
how the departmental covariate must be weighted to yield the regres** 
sion coefficient B.« Thus the only novelty is the nature of the 
criterion. 
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A convenient choice for Z. is X. , the mean score on the predic- 
tor X ^^ng students in the departaent. Mote that ve are asstssing 



have retained exchangeability in the model by explicitly SK>deling 
the systematic (nonexchangeable) part of the distribution of the 
regression coefficients. In principle, this model can be tested by 
embedding it in a still sore complex system* 

When the model (1), (3), and (4) is employed « the higher order 
pameters G and£* must be estimated. The least squares estimate 

is pulled toward the point Z*G on the line ZG to yield the 
empirical Bayes estimate B^« 

Figure 1 displays the difference between the consequences of 
assuming (1) and (2) or (1), (3) and (4). Under (1) and (2). the 

values are pulled toward a coomin point «rtiile under (3) and 

(4) they are pulled toward the line. 

Of course, (3) can be modified to incline more covariate infor- 
mation (that is a vector of departmental factors), as well as a 
vector of Bs rather than a single coefficient. As the heterogeneity 
of Che departments within a proposed cluster increases, presmably 
more relevant covariate information must be captured in order Co 
preserve the applicability of the empirical Bayes methodology. 

FtOT the point of view of validity assesmsent, the necessity 
of constructing clusters of departments and carrying out separate 
analyses for each is scxaething of a nuisance. One of the goals 
of the present study is to determine the extent to which one can 
dispense with clusters by incorporating sufficient covariate 
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Flgure 1: Effects of Enpirlcal Bayes Estlnation 

(Xliiastrativc) 
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information into the aodel through assumption (3). A clear litBita- 
tion is the amount of such information available* In this setting, 
ve are constrained essentially to covariates derived from the 
predictor scores of students in the department. In other applica-- 
tions, covariates in (3) need not bear any relation to the predictors 
apf^aring in (1). 
3.3 Descriptions of Models 

We will be concerned with the estimation of prediction equations 
for individual graduate departments. For department i, these equa- 
tions take the form 

ij 10 il ij i2 ^ij i3 xj ij 
where j indexes students within departments. The three predictors 

are ORE verbal and quantitative scoresand UCPA. In what follows, 

the verbal and quantitative scores have been rescaled by dividing by 

200 so that their regression coefficients should be of comparable 

magnitude to that of UGPA, which is on a 0-4 scale. As usual, the 

random errors are assumed to be independently normally distributed, 

and interest centers on the estimation of the vector of parameters: 

The method in current use, least squares based on data from 
a single department, will be the standard against which the new 
procedures will be judged. The various empirical Bayes procedures 
we propose will differ along two basic dimensions: (1) whether the 
*ull family of departments or only clusters of departments are 
treated simultaneously, and (2) the geometrical structure of the 
shrinking^ that is» to a point, plane^ or quadratic surface* In 

\ 
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discussing the first pointy ve sliall generally enploy the sMe set 
of five clusters of departaents: husanitiesi social sciences 
(exclwiing psychology), psyclMlogy, biological sciences, and i^ysical 
sciences. Table 1 presents tl^ total ninber of departaents belonging 
to each cluster and participating in the Validity Stiuiy Service, as 
irell as the total nu^er of students in each cluster. 

To enhance re^ability, ve will esploy the flmeo^nic EBxx 
to diiignate the various empirical Bayes si^els. The third symbol 
will be F or C depending on whether the full fraily or clusters of 
departaents are used. The fourth S]^bol vill be p,f,qi or q*, 
depending on whether the least squares estimates are pulled toward a 
point, a plane (flat surface), or one of two quadratic surfaces. 
Thus, for example 9 EBFf denotes the approach involving the full 
family of departments and shrinking toward a plane. 

« While the clustering feature of these models is easy to 
describe, the geometrical nature of the shrinking is more difficult. 
The general structure takes the form 

(6) B,* » Z,*G ♦ 

I I 1 

where 

(7) 'V N(0, Z*). 

It remains to specify the covariates included in Z., which in turn 
determine the dimension of G, the matrix of unknown coefficients 
to be estimated. Hote that ^^^^n the level of clustering, equations 
(3)^ (6), ami (7) determine the models 
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TABLE 1 



Niraibers of D^partnents and Nurabers of Students by Cluster: Total and Half-Sasiples 

Total Half-Soaple 1 Half-Saaple 2 

# Depts. # Students # Depts. # Students # Depts. # Students 



190 



6,946 



142 



3.172 



142 



3.16b 



Biological 
Sc iences 

Humani ties 

Phys ical 
Sciences 



25 



27 



46 



Psychology 23 
Social Sciences 69 



916 

753 
1,379 

3.010 



23 

19 

25 

21 
54 



445 

330 
548 

434 
1,415 



23 

19 

25 

21 
54 



440 

335 
550 

429 
1,414 
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We nov present the different choices for 
EBCp, EBFp: is a constant. Thus G is identically one and the 
model reduces to the fora 

Recall that for EBCp, Z * is estiiaated separately for 
each cluster, 

EBCf, EBFf: 1^ is a four cooponent vector of departaental covariates: 
a constant (Z*^^), mean verbal score (2.^), mean 
quantitative score (Z^^^* mean UGPA (Zr^). (Means 
are taken over all students in the department J Corre- 
spondingly, G is a 4x4 matrix of unknown parameters 
and £ * is a 4x4 covariance matrix that is also unknown. 
For EBCfy G is estimated separately for each clustv^r 
buty for reasons of parsiiMnyi a comson estimate of£* 
across clusters is employed, 

EBFq: Z. is a seven component vector including Z^^^, Z.^^, Z^^* 

Z,,, Z.,^, Z.?^. Z.J', Such a model allows us to explore 
the possibility of a nonlinear relation between the 
regression coefficients of interest and the departmental 
covariates. The corresponding £BCq model is not con- 
sidered because of the number of coefficients to be 
estimated, 

EBFq*: Z^ in EBFq is augmented by three interaction terms 



^il ^i2' ^il ^i3» ^12 
It should be noted that EBCp is really a special case of EBCf 

with a single covariate, the constant. In its full generality. 
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assiraption (6) implies that the vectors of coefficients in the 
prediction equation (5) are not normally distributed about a point 
in 4-8pace, but rath.r are normally distributed about a plane in 
S-space vith the typical (average) set of coefficients being linearly 
related to the level of achievement of the average student in the 
department 9 as measured by GRE verbal and quantitative scores, and 
UGFA. The nature of this linear relationship is determined by the 
valut: of G. 

Hov closely the apparent values of fall to the plane Z*G is 
roughly indicated by the sise of the diagonal elements ofT,^^ All 
other things being equal , a smaller estimated results in the 



least squares estimates being pulled more strongly tovard the plane. 

We did not expect EBCp to vork well, given the heterogeneity 
of departmental types within four of the clusters. Rather, it was 



introduced as a benchomrk against which the performance of the other 
empirical Bayes procedures could be measured. On the other hand, 
EBCf seemed to possess sufficient flexibility to offer scne premise 
of reasonable performance. 

Models EBFq and EBFq' postulate that the vectors of coefficients 
B^ are generated from a quadratic surface rather than a plane. That 
is, these models permit a nonlinear relation between the coefficients 
and the covariates. Our plan has been to study the performance 
characteristics of EBFq* and, if they proved favorable, to investigate 
the possibility of progressively simplifying it, that is, employ EBFq, 
EBFf, or even EBFp. From the point rf parsimony, we would prefer to 
use a miKlel from the ETF fasily provided its pe formance essentially 
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aatches that of Che ESC faally. In Section 3.S, we will discuss 
various measures of performance. 

Before turning to that section, we will describe some features 
of the parameter estimates. It should be noted that we employ 
maximtm likelihood estimation, using the Of algorithm (Dempster, 
Laird, & Rubin, 1977) as a computational technique for obtaining 
estimates of the hyperparaseters 6 aodS*. As a by-product, the 
posterior distributions of B^, given the data and the maximtai ' 
likelihood estimates, C and £ * are produced. The means of these 
posterior distributions serve as the empirical Bayes estimates of 
B.. " The essential details have already been {Hiblished (Braun, 
Jones, Rubin, & Thayer, 1983), while the Appendix contains additional 
material relevatit to the present application. 
3.4 Model Description 

One of the drawbacks of the more complex empirical Bayes 
models we have introduced is that they are difficult to visualise 
and understand. Even the relatively simple EBFf requires a fair 
amount of effort before we can draw some insight from its ntmerical 
characteristics. To illustrate, in this section we will carry out 
an analysis of EBFf and compare it to least squares. 

Recall that for EBuff we asstoae that the vectors of regression 
coefficients for the departments are themselves generated from a 
linear regression on the mean test scores of the students in the 
department. This multivariate linear regression is characterised by 
a matrix of coefficients, denoted G, and a normal error structure 
characterised by a covariance matrix., denoted 2*. Table 2 presents 



2 

Ruaerical Character iit let of « Fitted E^irical Bayes Model (EB2b) 
C Eftioate of Matrix of Segreaaion Coefficients at Ryfwrparraeter Level 



1.60 
- .66 

• .70 

• .53 



1.20 
.24 

- .25 

- .26 



,03 
.06 

,34 

,25 



- .29 

- .15 

- .01 
.41 



£*: Estisate of Covariance Matrix at Ryperpameter Level 



10-2 



9.74 



1.01 
.60 



.07 
.04 
.17 



-3.85 

- .83 

- .20 
2.03 



I* (correlation form) 



.42 



.05 
.13 



- .86 

- .75 

- .34 
1 
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the estifliates G ami of these pameters for the GRE Validity 
Study Service data described in Section 2, 

One interpretation of G is that, on the average^ a department 
vith specific mean test scores q, and u should have^ according to 
the fitted fiodel, the following regression coefficientc for the 
constant, GRE-V, GREH2, ami QCFA: 

- 1.60 ^ .66 V - .70 q" - .53 u, 
• 1.20 + .24 V - .25 q .26 u, 

* 

*i2 " '•^'^ ^ .08 V ♦ .34 q - .25 u, 

- -.29 - .15 7 - .01 q" + .41 u. 

Of course, actual departaents with these mean test scores will 
have true regression coefficients that differ from those above 
because of t' c variation between departsents, exiiressed by the error 
component D in equation (3). Empirical Bayea tries to estimate the 

true coefficients by combining the information in the leas, squares 

* # # # 

estimates and the **typical estimates'* B.^, B,^^ B^^^* ^£3* 

Interestingly, the diagonal elements of the estimate of 
are rather small, compared to the estimated variance of the least 
squares estimates. Accordingly, the empirical Bayes estimate 1ft. 
falls nearer to the typical estiiuite B. thsn to the l^st squares 
estimate of B^. In other words, the apparent variability in the 
least squares estimate is so large that the empirical Bayes com- 
promise leads to shrinking the least squares estimate almost all 
the way to the plane. This is illustrated schematically in one 
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dimnsion in Figure K Each departiMat's estimated prediction 
equation under empirical Bayea remains unique because each de* 
partment possesses a unique set of ni i test scores* ttovever, 
two departments vith the soaie i^ean test scores would obtain 
estimated prediction equations from rapirical Bajres that would 
be nearly the same. That is, the differences between the least 
squares prediction equations for the two departments are largely 
attributed to randcm variation and are eliminated by the empirical 
Bayes process. It should be emphasised that these findings are 
particular to this study; in other applications, the precision of 
the individual unit's least squares estimate may be comparably 
greater and, as a consequence, have a greater direct influence on 
the corresponding empirical Bayes estimate. 

A 

Inspection of G reveals that, in general, we should expect that 
the coefficient in the departmental prediction equation associated 
with a particular test score will increase as that department's mean 
test score increases, that is, the diagonal elements of G are all 
positive. Furthenaore, that same coefficient will decrease as the 
other aean test scores increase, though this effect tends to be 
smaller. Since the intercept decreases with increasing mean test 
score, we may say roughly that, as the mean test scores increase 
across a series of departments, the corresponding prediction planes 
become s<»Bewhat steeper and, as there appears to be no association 
between the siase of the slope and the estimate of residual variance, 
the quality of the fit becomes somewhat better. 
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Figure 2 



8catC«rplot of Inimtpt OMfflcicnts for Modelss All DeporeMenta 
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The diagonal elraeats of r* clearly abow that acroaa departaenta, 
B^^ displays the aost disj^rsion, followed by B.^, B^^, and B^2 
order. The correlation matrix derived froa ^ also provides useful 
infomation. We note that the correlations aaong the intercept, the 
verbal and quantitative coefficients are all positive and aodest in 
size compared to the rather large negative correlations they exhibit 
vith the UCPA coefficient. Although we have not developed a formal 
model to explain this pattern, it is very likely an outgrowth of the 
selection processes operating at the departmental level. For exaple, 
such a pattern would likely develop if admissions were based «>re 
heavily on the GRE or UGPA, but not both, in most departments. 

To develop an appreciation for the effect of the oipirical Bayes 
approach, we display in Figure 2 a scatterplot of the least squares 
against the empirical Bayes (EBFf) estimates of the intercepts of the 
within-department prediction planes. Although the two sets of estimates 
are strongly associated, the dispersion of the former is considerably 
greater than that of the latter, the ratio of the standard deviations 
of the marginal distributions being about five to one. These aspects 
of the eopirical Bayes models are poor substitutes for more informative 
ways of looking at eight-dimensional space. Unfortunately, such views 
are difficult to capture on two-dimensional surfaces. 
3.5 Model Selection 

The selection of one among many complex models in this setting 
involves a number of criteria related to both gmydnesa-of-f it and 
stability, as well as nonstatistical considerations. Our approach 
has been to screen the available empirical Bayes models with a 
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view to winnowing out all but tw> or three, and then to eoa^re the 
perfonaance of thosfe remaining with that of least squares on the 
basis of cross-validation studies. We shall present our analysis 
in approxinate historical sequence. 

Because empirical Bayes estimates for a given department are not 
just based on data from that department alone, they do not appear to 
fit the data as well as least squares estimates based solely on the 
department's data. For example, least squares estimates have the 
property that the mean residual over all students in the department 
miat be aero. That is. the average predicted FYA and the average 
observed FYA are equal. In general, empirical Bayes estimates do not 
share this property. Thus, one dimension along which «e can compare 
different empirical Brfyes models is how close they come to the ideal 
of producing aero-mean residuals.^ 

Three models were chosen: EBC^>^Fp, and EBFq'. Although only 
EBCp makes use of the cluster structure,, for each method the mean 
residuals by department are grouped by cluster for purposes of com- 
parison. The mean residual for a department is the difference between 
the average observed FYA and the average predicted FYA derived from 
the method under study. Figure 3 displays box-and-whisker plots (Tukey. 
1977) of these sets of mean residuals for three of the clusters. As 
one would hope, the distribution of mean residuals in each cluster- 
method combination is approximately centered about wro. However, 
there are large differences in variability as^ng the distributions. 
Since less variability is to be preferred. EBCp must clearly be 
rejected as a viable alternative. Apparently, the heterogeneity 
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NOTE: A box-and-vtilsker plot displays five is^rtant characteristics 
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d • loMr quartile 

a ■ nlnimm 



• 36 



-36- 



aoiong departments within a cluster is sufficient to preclude a 
simple shrinking-to-a-point model fr«B performing adequately. 

EBFf performs somewhat more poorly than EBFq' but requires fewer 
parameters. Since the ultimate goal of the analysis is to produce 
models that can predict FYAwell, we prefer to consider other aspects 
of the models before deciding to discard one or the other. One such 
feature is the scatter of the empirical Bayes estimates of the vector 
of regression coefficients about the surface from which the true 
vectors were apparently generated. Excessive variability or systematic 
patterns in the scatter may indicate deficiencies in the formulation of 
the higher level of the model— equation (6). 

We have chosen to consider tw> statistics. Let 

g. - !1b. - z.'G ;; 

be the Euclidean distance (in parameter space) between the point repre- 
senting the empirical Bayes estimate of the true vector of regression 
coefficients for the department and the corresponding point on the . 
estimate of the surface from which these vectors were generated. 
Since g^ gives equal weight to each coMFwnent , we also constructed 
the statistic 

h. » (B. - Z.' G)' Z., 

which measures the difference in predicted FYA for a student with aver- 
age test scores, obtained by using B. or Z.'G. Here the componentwise 
differenc^f are weighted by the mean scores in the department. Although 
g. and h. may be studied in isolation, we decided to plot them against 
the quantity 
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where « ^^^il' ^i2' ^i3^' vector of mean scores in the 

department on GRE-V, GRE-Q, and UGPA, and C is obtained by taking the 
median, componentwise, of these means across departments in the 



cluster. Thu* , the first component of C is the median of the average 
verbal scores for the departments in the cluster* Large values of d^ 
indicate that department i is atypical, in some sense, of the depart^ 
ments in the cluster* 

Plots of R. or h. against d, may indicate whether there are 
systematic patterns in the quality of fit of the model. For this 
reason, we construct these plots separately for each cluster, even 
though the empirical Bayes model we employ may not utilize the cluster 
structure at all. We could more easily construct a simple plot aggre- 
gating over all clusters. But the present approach will allow us to 
detect difficulties at the cluster level. 

Plots of against d. were generated for three models: EBCf, 
EBFf, and EBFq' . Interestingly, there were no systematic patterns 
t'vident except for a tendency for large values of d^^ (atypical depart- 
ments) to be associated with the small value of g^. Figure 4 provides 
a representative illustration. On th^ other hand, the values of g. for 
E^Ff were considerably larger than for the other two models. EBFq' 
performed somewhat more poorly than EBCf, but the latter requires many 
more parameters. Similar comparisons were carried out on the basis of 
plots of h. against d. . Differences among methods were somewhat reduced, 
and again there were no evident systematic patterns. Consequently, no 
decision to reduce the number of models being considered was made at 
this point. 




Figure 4: EBCf for IhMMnities Cluster j 
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Since EBFq' does involve the estimation of a large nisober of 
parfflieters, we explored the possibility of eliminating 8<me of the 
covariates appearing in (6). In particular » using slight generaliza- 
tions of the sensitivity analyses advocated in Belsey, Kuh. and Welsch 
(1980), we were able to demonstrate the reduiKlancy of the cross-product 
terms Z^j Z.^* Z^^, Z^^ Z-^. Consequently, we began the series 
of cross-validation tests with four models: least squares, EBCf , 
EBFf, EBFq, Among the empirical Bayes models considered, EBFf is the 
most parsimonious in terms of the number of parameters to be estimated. 
3,6 Cross-Val idat ion 

Cro8s-val7datJ|.on as a model selection technique has a long and 



dent set of data. In practice, replicate data sets are rare, and the 
usual remedy ^as been to split the existing data set in half, fitting 
Lhe model on one half and using it to make predictions for the other 
half. Of course, the drawback is that the estimated coefficients will 
be more variable than if they were fitted using the entire data. None- 
theless, the method gives a good indication of which models are too 
sensitive to artifacts in the data at hand. 

Our approach has been to split each department in half at random 
and, thus, to construct two half-samples of the full validity data base 
at the departmental level. One' half-sample is selected to be the 
calibration sample; least squares and three empirical Bayes methods 
are applied to it. The resulting fitted BK>dels, together with the 
predictor information for students in the other half-sample, the 



honorable history (Stone, 1978). Essentially, the idea is to use a 
model fitted ori^one set of data to predict the results for an indepen- 
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croBS-validation «aapU, are used to predict FYAa that aie ccmpared 
to the actual FYAs obtained by those students. Then the roles of 
the two half-aaaples are reversed and th^ process repeate**. 

In carrying out the construction of the half-saaplcs, we eliminated 
departments with fewer than 10 students so that each half-sample %#ould 
have at least five students. As a result, only 142 of the original 190 
departments (6,340) students were included in the cross-validation. 
Table 1 displays the breakdown by cluster. ^ , 

There are a number of ways to assess success in prediction. In - 
the area of measurement, it is common to correlate the predicted FYAs 
with the observed FYAs. The higher the correlation, the more useful 
the pi edict ions are held to be. While this may be reasoni-ble in some 
settings, it is also true that some systems that produce very large 
errors of prediction may neverthelests yield high correlations with the 
observed scores. It should also be noted that in the cross-validation 
setting, the familiar relation between the square of the correlation 
and the proportion of variance explained no longer holds. Thus, we 
prefer quantities related to the mean sq ired error of prediction as 
our criteria for determining the usefulness of a model. 

We have employed two related measures of agreement. The first, 
ro^t mean square deviation (RMSD), is commonly used by statisticians. 
Suppose there are m students in a half-sample of the department. Let y 
denote a generic obtained FYA and y a generic estimated FYA based on 
coefficients from the other half-sample as well as the predictor scores 
of the student. Then, 

RMSD " [m J- Cy: - y;> J 

j - 1 J J 
12 
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Like the familiar standard deviationp it assesses the prediction error 
on the same scale the data are measured on« 

A second measure is the difference beti#een predicted and actual 

2 . 2 . • 2 

R , which we will denote by DR « In regression analysis « R 

is the proportion of variance in the criterion that is explained by 

^ 2 
the regression. Given a fitted models one can calculate i^at ^he R 

would be for a new data set following the same model but with a dif- 

2 

ferent set of predictor scores. We call this the predicted R . More 

2 

precisely, we obtain the predicted R for a regression derived 

through seme procedure (least squares » empirical Bayes, and the like) 

applied to the calibration sample by assuming that the least squares 

fit to the cross-validation sample will yield exactly the same 

characteristics (slopes and residual variance). Combining those 

characteristics with the distribution of predictor scores in the cross- 

2 

validation sample yields a value of R for the put-'tive least squares 
fit. We are acting as if the fitted regression to the calibration data 
will provide a good estimate of the least squares fit to the cross- 
validation data. This, places, perhaps^ an extra burden on the empirical 

Bayes fits* With the new data in hand> the variation in the residuals, 

... 2 . 

y-y, can be compared to the variation in y. The actual R is 

^ _ RMSD^ 



variance (y) 



2 

This is something of a misnomer s^ince the actual R compares the size 

of the mean squared error of prediction to the variance of the criterion. 

It will be negative when the predictions y are inferior to the mean* 



2 

The DR IS defined as 



2 2 2 

DR ' predicted R - actual R • 
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A positive value of DR indicates that Che actual perforaance of the 
fitted WHlel is not as good as would have been expected froa the 
original data. One would expect such^ behavior of estiaatea based on 
least squares because of the potentially great effect that idiosyn- 
crasies of the particular sample may have on the least squares 

estimates. On the other haid» superior procedures should exhibit 
2 

DR values approximately synetrically distributed about sero with 
only flK>dest dispersion. That is, the predicted performance is a 

reliable estimate of the actual performance. In a sense, RMSD mea- 

2 

sures hov accurate the predictions are^ irtiile DR"^ measures hov 
well we can guess, a^priori, at the quality of tliose predictions. 

Because of the volume of information generated in this segment of 
the analysis, only a small representative sample of the findings can 
be presented. As before, we display the results grouped by cluster so 
that particular patterns can be discerned* In all five clusters, least 
squaies estimates perform more poorly on both measures of accuracy of 
prediction than all three empirical Bayes estimates. On RMSD, least 
squares is typically about 10 percent larger than the empirical Bayes 
methods, which are nearly equivalent. Figure 3 contains box-and-whisker 
plots that schematically Jisplay key features of the distributions of 
RMSD values for the four methods for two clusters, humanities and phys- 
ical sciences. Note that in each cluster, the maximum RMSD for least 
squares is exceedingly large and certainly dominates the maximum ^ 
exhibited by the ttipirical Bayes methods. 
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Figurc S: Box-and-Vhlskcr Plots of 

Root Mean Square Deviations (RMSD) 
for Four Models. 
Cross-validation Analysis 
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Because the SMSD is aeasured in units of FfA, its aagnitude can 
be assessed for practical significance. The typical value of RMSD 
seems to be about ere standard deviation, corresponding to a fair 
aaount of difference between predicted and observed. The largest 
values, about two standard deviations, are indicative of essentially 
useless predictions. 

As we have noted above, KBCf, EBFf, and BBFq are nearly e<iuiva- 
Icnt, with EBCf perhaps being somewhat worse. Since BBFf is the most 
parsimonious in terms of the number of parameters to be estimated, it 
is preferred . 

Consideration of OR leads even more un^iguously to the 
samr conclusion. Figure 6 presents box-and-whisker plots for the 

2 

hiffiianities and physical sciences clusters of the distributions of DR 
for the four methods. The inferiority of least squares is apparent. 
The predicted R tends to grossly overestimate the actual R , result- 
ing in large positive values of DR^. In fact, more than three-quarters 

2 

of the values are positive. By contrast, the distribution of DR for 
the empirical Bayes methods is more nearly symmetric about zero 
although EBCf does have a tendency to produce some modestly large 
values. The superiority of EBFf is evident. Given our definition 
of predicted R^, this result is slightly surprising. Hevertheless , 
its clear implication is that empirical Bayes fits were more repro- 
ducible than least squares fits. Simply put, the predicted FYAs from 
empirical Bayes were closer to the mark than those of least squares, 
and the quality of the fit could itself be more accurately predicted 
for oipirical Bayes. 
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Flgure 6: Bos-and-Whisker Plots of 
1«' for Four Modolt. 
Cross-Valldatlon Analysit 
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3.7. Sttppleaenfry Amilyea 

Of course I ve have by no iKans exhausted the family of prediction 
systems with i^ich empirical Bayes can be compared* For example, it 
has been suggested (Wilson, 1982) that the data frc» all departments 
in a given discipline be piioled and a single least squares regresaion 
plane be fitted* We have carried out some exploratory analyses in 
this direction by considering four disciplines: psychology, econc^s- 
ics, chemistry, and physics. Using the same cro8s-*validation scheme 
described above, we compared the predictions of BBFf with those of the 
pooled least squares plane for each of the four disciplines. In each 
case EBFf proved superior. Another draifback to this kind of pooling 
is that many disciplines are represented by only a few departments 
in the Validity Study Service data so that further pooling across 
disciplines is required* 

Another approach consists of grouping departments by the charac- 
teristics of their students. One suggestion (Burton, 1982) is to use 
the difference between mean GRE verbal and mean GRE quantitative scores. 
A simple scheme involves the formation of two clusters, according to 
whether this difference is negative or positive. The data within each 
cluster is pooled and a single least squares regression plane 
fitted for each cluster. Again, empirical Bayes perforircd better 
overall in cross-validation* 

Finally, we investigated the possibility of including among the 
departmental covariates other features of the distribution of predictor 
scores of the students such as variances and covariances. An important 
motivation for this step was the concern that apparent differences 
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t»etira«i« deparCoents in the magnitude of the coefficients in the pre- 
diction equations may be due to variation in the a»ount of restriction 
of range experienced. Hoviever, the inclusion of prc^iictor score 
variances did not result in any improvemnts. Thus, it does not 
appear that our conclusions have been driven by differential restric- 
tion of rmige. 
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4. Subgroup Aaalyaes 
4^1 Introduction 

One question of considerable interest concerns the possible dif- 
ferential efficacy of the CRE battery or UCPA as predictors of FYA 
across various subgroups jf the population. These subgroups say be 
defined by sex, race, age, ^ther tongue, or sooe combination of 
these factors. Classification of students by these factors, however, 
further exacerbates the problra of small sample sises* In fact, it 
is usually impossible to obtain separate prediction equations for 
each group in each department by ordinary least squares methods. 

Bayesian methods, and empirical Bayes methods in particular, 
provide a reasonable solution to this problem. In the context of 
the Graduate Management Admission Test, Braun and Jones (1981) 
denKinstrated that separate prediction equations for White students 
and Black students could be estimated in each institution, even 
though Blacks were less than 6 percent of the cohort. Technically, 
the extension of the model of Section 3 to encompass this application 
is very simple. 

Suppose, for exmple, that students are categorised according 
to two factors, each at two levels. For def initeness , suppose these 
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factors are »ex (sale or female) ««! age (leas than 25. more than 25). 
We then as suae that 

i8> ^Ij - ho * hi * ^2 Qii * ^3 "ij 

.1 



where 



»i8 * »i9 "ij * 'no "ij * 'ill "ij 



and 



l2 



♦ e 



1, if student is female 



o, 1 



f student is male 



I, if student is more than 25 
o, if student is less than 25 



As before, i indexes departmentr and j indexes students within depart- 
ments. The higher level of the . odel remains the same. The expanded 
vector of coefficients B. - (B-^ Ru'-'^ill^' " "sumed to follow a 
regression of the form B. - Z.'G ♦ r . Equation (8) allows us to fit a 
different prediction plane for each t *x-age combination, the only 
restriction being that any one of the pUnea can be linearly determined 
from the other three. (It would require three indicator functionate 
fit four independent planes. Given the large number of coefficients 
to be estimated, this extension was not followed through.) 

The indicator functions and determine which set of coeffi- 
cients comes into play. For example, for males over 25, 1^ - o and 

- 1 so the plane is determined by 8io»"'*i3 *i9»"* 
B.jj. For females over 25. - 1 and - 1 ao the plane is 



determined by all twelve coefficients. 
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Another way to think of the process is to imagine that the first 
four coefficients determine the basic prediction plane for males under 
23. The other tw> sets of four coefficients represent the modifications 
that must be made to the basic plane to account for differences for 
females or older students, or both* 

A more conservative approach is to fit a model in which the separate 

planes are constrained to be parallel* Such a model would take the form 

(9) » B. ♦ v.. ♦ B.« Q.. ♦ B.^ U, . 

ij lo il ij i2 ^ij i3 ij 

1 2 

♦ I, . B., ♦ I- . B-^ ♦ e. 

The inclination of the plane remains the Bmae for all groupSg and only 
its height may vary. A more complete treatment of fitting empirical 
Bayea models to sparse data may be found in Braun et al, (1983). 

Model EBFq was aaployed as the basis of our studies in this sec- 
tion. Thus the higher level of the model may be represented by a 
quadratic surface determined by linear and pure quadratic terms. When 
the departmental regression takes the form (9), we denote the model 
by EBFqe; when it takes the form (8), we denote the model by EBFqe' . 
4.2 Results 

To examine the viability of differential prediction systems, we 
chose to group individuals by age and sex as described above. At 
first, to eliminate confounding effects we selected only individuals 
who were White and for whom English was the primary language of com- 
munication. Unfortunately, this reduced the total sample by about 60 
percent* We, therefore, abandoned this selection and employed all 
students with complete information on FYA, GRE verbal and quantitative 
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ficores UGPAy age» and sex. The total niis^r of students vas 3^491 » 
soiwwhac Lover than the original 69946. Table 3 displays the breakdo%#n 
by sex and age for each half-sa^le. Again, departmnts with fewer 
than 10 students %rere excluded frc» the final cross-validations. 

Half-samples ii^re generated as before, with the eligible stu- 
dents in each depa^isent being divided into two equal groups. The 
models to be compared p least squares » ESFq, hBFqep and EBfqe\ i^re 
each fitted to one half of the data base and then appli<;d to the 
other half. Only the RHSD measure was calculated. In general, the 
performance of the empirical Bayes models was superior to that of 
least squares, but EBFq, which does not t:uke account of age or sex, 
did better than EBFqe £.id EBFqe*, which do. Thus, on the basis of 
the cross-validation, there is no reason to employ separate prediction 
planes for the difterent age-sex groups. 

On the other hand« there appears to be a consistent pattern in 
the fitted coefficients that may merit further investigation. Under 
model EBFqe, for a given set of predictor scores, the predicted FYA 
for females tended to be higher than that for males, while the 
predicted FYA for those over 25 tended to be higher than that for 
younger students. The former effect was much more pronounced than 
the latter. As noted above, neither was borne out in the cross- 



validation, so we must conclude that a sex effect, if it exists, 
must be estimated by using more delicate s^thods. 

The analysis was continued by considering differential predic- 
tion equations by race. Unfortunately, race was known for only about 
35 percent of the original sample. We employed three categories: 
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TABLE 3 

Counts of Students with Coaplete Infonaat ion on GRE-V, GRE-Q, UGPA, FYA, Age, and Sex 





i Half-Saraple 1 
i # Depts. # Students 


i 


Half-Saaple 2 ' 
# Itepts. # Students 




Total 


^ 121 


2,747 


1 


121 


2,744 




Biological 
Sc i enc^s 


1 19 


347 


[ 


19 


341 




HusDa nit i e s 


1 lo 




1 


ID 






niysical Sciences 


1 22 


505 


! 


22 


506 




Psychology 


1 20 


390 




20 


387 




Social Sciences 


t 40 


1,225 




40 


1,227 






Males 


Feaales 


Total 1 


Male 9 


Females 


Total 


Age < 25 


1 ^'^^^ i 


994 1 


2,2*7 j 


1,199 } 


1,054 j 


2,253 


Age > 25 


! i 


313 ! 


500 j 


201 j 


290 1 


491 


Total 


' 1,440 ^ 


1,307 [ 


2,747 j 


1 ,400 1 


1,344 1 


2,744 
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Whitest Asians and Oriental Americans, and other minorities 
(principally Blacks). These constituted approximately 88 percent, 
2 percent, and 10 percent, respectively, of the reduced sample. 

The half-samples were constructed as before and the cross** 
validation analyses showed again that empirical Bayes outperformed 
least squares, with EBFq proving superior to EBFqe and EBFqe'. 
Thus, there appears to be no justification for t:mploying different 
prediction planes. Consideration of the coefficients in EBFqe 
indicates a tendency for the predicted FYAs of Asian-*Asierican stu- 
dents to be somewhat higher than those of the other students with 
the same predictor scores. 

Our analysis has been carried out in the context of the empirical 
Bayes formulation. It is quite conceivable that another prediction 
system, involving more pooling of data, might yield other conclusions, 
particularly with regard to the question of different prediction equa-- 
tionH by rare. In this setting, the small sample sizes for minority 
groups perhaps require that further constraints be placed on the 
fitted models. More work in this direction should be carried out. 
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Clustering 



I Introduct ion 

As we noted in Section 1, the formation of clusters of departments 
has always played an important role in the study of the validity of 
various predictors of graduate school performance^ Such clusters are 
otten created by grouping departTCnts by discipline or by sets of 
rtlated disciplines. Another approach has been to group departments 
by their location a*ong a verbal-quantitative axis (Wilson, 1979) • As 
Wf have seen in Section 3, use of five broad clusters of departments 
did not improve the performance of the empirical bayes metliOds. It 
Ci)uld be argued that, despite the use of extended models that can 
u.ommodaLe ht- 1 erogene i ty , the five clusters are too divert:^ and that 
ivrhaps lU or even 15 clusters would be more appropriate. Thu , one 
would bv fitting EBCf, which would require a separate plane — cf. 
I. i J in Section 3.1 — to be estimated for each cluster. However, with 
n t liist^Ts, tht» number of regression parameters to be fitted in 
t-quUiDn (j) is n times the number required by EBFf. Since n = !) 
hii TU3t pri)ve sufficient, it is difficult to believe, on grounds of 
ibiliLv, that higher values of n would be beneficial. 

C^Mis^^qut'nt ly , we approached the problem f om a different per- 
^in-ciivv, Wt- had information availablt: ^,o us (see Section 2) on the 

fiarav Cer 1st ics of the applicants to all departments in the United 
Statt's for a ^iven year, summa»-ized by department type (discipline), 
wur aim wab to group disciplines by the similarity of the characteris- 
' M »t thf-ir applicants. (Unfortunately, corresponding data for 
almittcd studt^nts or enrolled st udent s wi>re not available.) Wf employed 
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two different sets of data. The fir't involves the distribution of 
types of GRE Subject Tests taken by the applicants and the scores 
they achieved. ThuSi with each discipline is associated a t%FO-i.actor 
contingency table containing counts of applicants sorted by test and 
test score. An example is given in Table 4, 

Although there are published methods for carrvinR out hierarchical 
clustering of nominal data (Hartigan, 1975; BMD-P-77), we we*:;:; not 
satisfied with their properties and devised our own method (Braun & 
Jones, 1982). The essential difference in our method is that » as we 
form clusters, we do not pool the data over the disciplines in the 
cluster but retain all the information for each member. A likelihood 
ratio statistic is employed to determine which disciplines form the 
most homogeneous grou^^s with respect to the distribution of the counts 
in the associated contingency table. 

The second set of data we used, which comes from the same source, 
provides the scores on GRE verbal and quantitative for all applicants 
to the departments in a discipline. Unfortunately, only the marginal , 
distributions and not the joint distributions were available. Thf.t 
is, for example, we knew how many had verbal scores between 300 and 
600 and how many had quantitative scores between 600 and 700, but 
not how many fell into both categories simultaneously. However, we 
were able to estimate this joint distribution by assuming that the 
correlation between verbal and quantitative scores in each discipline 
equalled the correlation in the entire applicant pool, a kno%m quan- 
tity. With each discipline we were able to associate a new two-'factor 
contingency table containing approximate counts of appl icants^ sorted 
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TABLE k 



Distribution of Students by Advanced Test Taken and Test Score. Genetics 



1 

1 






Distribution oi 


: Scores 










Test Name 


200-290 


300-390 


400-490 


500-590 


600-690 


700-790 


800-890 


900-990 


Total No. 
of Scores 


ADVANCED TESTS 




















BIOLOGY 




6 


32 


114 


239 


243 


69 


3 


706 


CHEMISTRY 








4 


6 


I 






11 


EDUCATION 


1 

1 




I 


1 


1 










ENGINEERING 






2 












2 


FRENCH 




I 


I 


I 










3 


GEOLOGY 








I 










1 


LITERATURE 






1 












i 


MATHEHATICS 






I 






I 






2 


MUSIC 






1 


I 










2 


PHYSICS 










2 








2 


PSYCHOLOGY 






2 


I 


2 








5 


SOCIOLOGY 






I 












1 




i 


7 


42 


123 


250 


245 


69 


3 


740 
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by their verbal and quantitative scores. The hierarchical cluster* 
ing procedure alluded to above was also applied to this data. 
5.2 Results 

Inasmuch as the results of the clusterings using the tvo data 
sets %rere quite sisilar, w will illustrate our findings with the 
GRE Subject Test data. For this example , only information on the type 
of Subject Test taken (and not the score on the test) was employed. 
In hierarchical clustering, clusters« once formed, are never split 
but rather are ccwbined with other clusters. In Figure 7, we 
display the last 13 steps of the algorithm, showing how the 14 
groups foru^d by that stage are combined. For example, at step 13, 
chemistry is combined with the medical sciences group. This new 
cluster is then linked to a biological sciences group at step 8. 
The different disciplines contained within each of the 14 groups are 
listed in Table 5. The labels assigned to these groups are only 
meanL to be suggestive, as there are a number of anomalies present. 
For example, mining appears in the paramedical group, well-^separated 
from geology, while sociology is allied with French, Spanish and 
music. 

It appears that these anomalies arise because of small sample 
sizes in some disciplines. On the whole, however, the clusters are 
quite sensible and provide 8oam useful insight into the similarities 
of applicants to the various disciplines. Note that chemistry and 
physics are quite widely separated although they are both academic 
scientific disciplines while econcmiics is s<nKwhat removed from the 
quantitative sciences. It remains co be seen whether these clusters 
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\ 



Figure 7 



Clusters of Cr«duat« Fields. Ni^bers Indicate 
number of steps before end of elgorltin at which 
clusters ere joined. Dashed lines denote five 
clusters employed In validitjf triala. 
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Grouping of Disciplines for 14 Clusters in Figure 7 



Education 

Educational Administration 
Education 

Phys ical Education 



Paramedical S Miscellaneous 

Other 
Speech 

Hospital Administration 

Nursing 

Audiology 

Occupational Therapy 
Home Economics 
Other 

Architecture 
Mining 

Business and Coimnprce 
Geography ^ 
Law 

Urban Development 



Psychology 

Educational Psychology 
Guidance and Counseling 
Social Psychology 
Anthropology 
Other Social Sciences 
Industrial Relat ions 
Social Work 
Psychology 



Soc iology 

French 
Spanish 
Sociology 
Music 



Mathemat ics 

Applied Mathematics 
St at ist ics 



Ccnmnuni cat ions and Hinaanities 
Linguistics 

Other Foreign Languages 

Russian 

Fine Arts 

Cosoiunicat ions 

Archeology 

Classical Languages 

Religion 

Far Eastern Languages 
Library Science 
Journal ism 
Italian 
Dramatic Arts 
Other Humanities 
Art History 
American Studies 
German 

C(K&parative Literature 

Philosophy 

Engl ish 



Chemistry 
Chemistry 

Biological Sciences 

Genetics 

Microbiology 

Biology 

Anatomy 

OptOTietry 

Bacteriology 

Entomology 

Veterinary Medicine 

Botany 

Zoology 

Other Biological Sciences 

Physiology 

Dentistry 

Pathology 

Forestry 

Parasitology 

Physical Therapy 

Public Health 

Agriculture 
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Mat heinat ics (contd.) 

Mat hemat ics 
Computer Science 



hnginperin^ 

AtTonaut ical Engineering 
M«>t al lurgical Engineering 
Kle tr ical Engineering 
(;i V 1 1 Engineering 
M*»chao ical Engineering 
Chemical Engineering 
lndai;t r ial Engineering 
i H h<' r Eng inee r i ng 



^ Sciences 

I'tKirmiico logy 

Nut r i t ion 

H i ocht^mi st r y 

Hiophy;:* ics 

?*tMrm«3c y 

^ *t •'.inngraphy 

' 'litT Phvsical Sciences 



TABLE 5 (contd.) 



Physics 

Astronray 
Fhys ics 



Geology 
Geology 



Political Science 

Slavic Studies 
International Relations 
Public Administration 
Political Science 
Near Eastern Languages 
History 



Econcwics 
Econoinics 
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provide useful information on the flow of students into graduate 
school. Presinablyp one would also have to study the structure of the 
coismon applicant pools. 
5.3 Applications to Validity 

The original purpose of this portion of the analysis vas to pro- 
duce new clusters to which we could apply oodels of the form EBCf, 
Therefore p we employed the final five clusters yielded by the algorithm 
and indicated on Figure 7. However, fitting EBCf to these five clusters 
did not produce departmental regression planes with characteristics 
superior to those of EBFf. We did not attempt to fit the model to the 
full 14 cluster partition displayed in Figure 7. 

Our second approach to the problem was to construe, a matrix of 
distances between disciplines based on the homogeneity of the distribu- 
tions of counts in the Subject Test-taken/Test-score matrix associated 
with each discipline. The multidimensional scaling program, HDSCAL 
(Kruskal & Wish, 1978), was then applied to the distance matrix. 
Three-, four- and five-dimensional scalings were produced, but only 
the f ive-diircnsional representation was judged minimally acceptable 
in capturing the salient features of the data. 

The corresponding five coordinates for each discipline were 
then added to the departmental level covariates for all departments 
in the discipline. Thus, in the empirical Bayes framework we have 
associated with each department a nuo^er of departTOnt-specif ic 
covariates based on average test scores of the students and a number 
of discipline-specific covariates based on national applicant char- 
acteristics. An empirical Bayes model was then fitted to the data. 
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using no cluster inforaation. Unce again, the prediction planes 
generated for each department proved no better than those derived 
from EBFf, which eoploys no discipline*Linked covariates. Thu8« our 
conclusion is that i^ile the applicant data and clustering aethods 
%rfe have considered say proaise intriguing possibilities, more work 
in required before these possibilities are realized* 
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t>* Conclusions 

The statistical analyses we have carried out indicate that the 
values of the coefficients ot a prediction equation for a departmnt 
are strongly related to the typical test scores earned by students in 
that departs^nt. Moreover p this relationship appears to be a linear 
one and does not depend on the type of departaent, bapirical &ayes 
methods make use of this structure to obtain estimates of the 
coefficients that improve significantly upon those derived by least 
squares using individual department data* 

That different departircnts with similar students (in terms of 
predictor scores) have similar prediction equations is not a little 
surprising, given the variety of departa^nts involved. What is more 
stirprising, better departments appear generally to have stronger 
levels of association bet%#een criterion and predictors. Although we 
hive pretty much ruled out differential restriction of range as a 
tijntounding factor, there are a number of plausible explanations. 
Ki)r example, grading standards may be more carefully observed, or 
^jcademic ability more important for successp in departments with the 
mi're <ible students* It is an intriguing puszle that demands the 
attention of specialists in higher education. VRiile we cannot shed 
much light on this issue, w can make some comparisons between the 
results of the least squares and empirical Bayes methods. 

Of necessity, the dispersion of the empirical Bayes coefficients 
.Kross departments is much less than that for the least squares coeffi 
v it nts* The cross-^validat ion analyses showed that the empirical Bayes 
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taodels produce better predictions for replicate data* Equally impor* 
tant, the empirical Bayes coefficients in the multiple regressionisr are 
essentially always positive and sore stable across replicate data sats* 
To demonstrate the comparative stability of coefficients derived through 
empirical Bayes , Figure 8 was constructed. One department was selected 
at random from each of four major groups of disciplines: humanities, 
biological sciences « social sciences , and physical sciences. The 
empirical Bayes ami least squares estimates based on the half-smples 
for each department were obtained , and the absolute value of the dif- 
ferences for each coefficient was computed. Thus, if the coefficients 
for one half-sample of a particular department were (b^, b-, b^f b^) 
and for the other half-sample (b b/, b.', b^'), then the quantity 

computed was the vector ( , b^ - bj' , ^^2 " ^^2* ' ^3 " ^^3* ^* 

This vector can be represented graphically in a number of ways. 
One such way is by means of an icon. Figure 8 illustrates the use of 
an asymmetric diamond icon. Essentially, each of the components is 
plotted as a vector starting at the origin and extending outward in one 
of the compass directions. The ends of theae four vectors are con- 
nected to form a quadrilateral. The size and shape of the quadrilateral 
indicate both the absolute and relative size of the componentwise 
differences. As Figure 8 convincingly demonstrates, differences 
between the empirical Bayes estimates from one half^sample to another 
are much smaller than those for the least squares estimates: In each 
case, the empirical Bayes quadrilateral is entirely contained within 
the least squares quadrilateral. 
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Flgure 8 



Intercept 



Figures 8a« c, ana d: 
(see following 2 pages) 

Differences across half- 
samples in the estimates of 
the four regression coefficients 
In a prt'iJiction equation by two 
mt'thods; Empirical Bayes and 
jst Squares, 

Note different scales on 
abscissa and ordinate axes 
and across all four sample 
departments. 
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Figure 8d: Sample Social Sciences Department 
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These properties are very important in tho context of the adniis- 

sion process* First, admission coosnittees are rightly disturbed by 

large year-to-year fluctuations in the prediction equations* Stability 

in the prediction equations is not only plausible but also xacilx- 

t-ites the development of a consistent set of admission criteria. 

Second, the prediction equations are fitted to data derived from the 

nost recent cohort to complete the first year of studies but are 

applied to prospective candidates* The predicted i^fAs for these 

candidates may play a role in the admission decisions, and the 

extent of that role will depend on how confident the committee is 

that Che predicted and actual FYAs will tend to be similar. One 

2 

miMsufp ot the similarity is the R coefficient, which, m this 
case» would be the proportion of variation in actual FYA explained 
by the FYAs generated by the fitted model. 

2 

The simplest approach is to take the R of the least squares 

2 

tit to the past cohort as an estimate of the R that will be 
realizrd when the predictions generated by that same fit are compared 
to the actual FYAs of the new cohort. The cross-validations showed 

that such estimates tend to be biased and considerably overstate the 

2 . 2 . . 

R , On the other hand, the so-called predicted R defined in Section 3, 

when applied to empirical Bayes estimates, provided reasonable and 

2 

approximately unbiased estimates of the H to be realized on new 

data with predictions generated from a regression derived from other 

2 

data. Inasmuch as the predicted R based on half-samples proved 
so useful, we have adapted it to the full samples. 
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Fjy,\ir«» 4 .^^np.ires the distribution of predicted R for EBFf 

2 

with thr distribution of K of the least squares fit for depart- 

2 

.n.MitH in the five clusters. Note that the R distributions for 

tmpiriijl Hayes display considerably less dispersion* «Our analysis 

2 » • 

strofiglv suggests that these R values are more reliable in a 

pitHlu'tion context than those of least squares. The differences by 

V luster in thf distributions deserve further study. It should be 

2 

ijih.is i zt'ti that thr cross-validation R we discuss here is some- 

2 

wh.it dittfrt-Mit in nature fri^mi the R typically produced m 
Vililitv studiis. Th*^* latter is based on regressions fitted to the 
! jT. i ,>t t hr current year, while the former employs regression 

:jtt» i i'» th»' prt»vious year's data. 

r 

];nn.'re familiar terms, our results indicate that a typical 

1. i-.jT tnuMit einp] i>vi ^a prediction equation estimated by the methods 

: tr "i t • 1 ;i.'r< Vinilfi realizt* validities between 0.3 and 0.3 in 

t 

• : ' M t iTi)^ t hi' grades of those candidates for admission who will 

\ i^llv iit^Miii a:iiJ [lersist. Because restriction of range correc- 
'in. n jv.' not hef-n implemented, the validities are rather lower than 

!,1 l>t» in unselected samples. Recent work on the validity of 
r h. LSAl i Hra'iT^ h Szatrowski, 1982) has detnonst rated that prediction 
: . ri' S t'-.timated thri)u^ih empirical Bayes, Ci^snbined with test score 
1 M -nnation on applicants, can provide estimates of validity corrected 
t r ^t b'iti.ni, Th^^^sr* procedures are perfectly practicable in the 

V inr» xt of the GKK Validity Study Service data, but have not been 
irri^ul >Mt hrre. 'Hie inclusion of scores on the Subject Tests 
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Figure 9 : R for Least Squares end 
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among the predictors would also enhance the validity of the prediction 



The employment of empirical Bayes methods also has facilitated 
study of the question of whether constructing separate prediction 
planes for particular groups of students leads to better predictions. 



Wf- investigated two different ways of classifying students: One by 
ago and sex and the other by race alone. In both cases , a single 
pr»»dict!on plane for all students performed essentially as well as 
the set of separate prediction planes. Thus, within the constraints 
set by the quantity and quality of the data, there is no evidence of 
dittorential predictive validity for the conanon predictors of graduate 
sc htn>l pi»r formance • 

P^inally, a methodological foray into cluster analysis has led to 
I nf»w approach to the clustering of academic disciplines. Various 
Ht4»mpts to capitalize on these clusters, in conjunction with empirical 
Haves, to produce improved prediction schemes were unsuccessful. None-^ 
rh»'It\ss, w.* believe the methods we have developed should prove useful. 
In particular, these new clusters of disciplines may lead to some 
insi^'ht into .shifts in the flow of students into various areas of 
»;raduate study. 

What then are the implications of our w^jrk for the GR£ Validity 



First, we believe that our finding of an apparently universal 
structure underlying the prediction planes for graduate departments 
is far-reaching. Not only is it an interesting result in a purely 
th**oretical sense, demanding some explanation, but also it holds the 



equat ion . 



Study Service, and what directions should future research pursue? 




ERIC 



-74- 



ERIC 



nromise of providing the Validity Study Service with an easily 
irapletnentable prediction system that can significantly widen the 
applicability and reliability of the annual validity studies. 
Almost all graduate departments can toe provided with useful and 
replicable results concerning the roles that various predictors, 
singly or jointly, can play in the prediction of the future perfor- 
mance of prospective students. 

In this report, we have stressed methodological considerations 
and general conclusions over results for specific departments. This 
toUows from the fact that we have only begun to explore the rich 
taraily of models within the empirical Bayes framework. Further work, 
will undoubtedly uncover other models that share the conceptual basis 
ot those we have described bere, but lead to different sets of depart- 
mt-ntal coefficients that perform better in practice. The work we 
h.ivr carried out can lead directly to a significant increase in the 
numbt-r of departments participating in the Validity Study Service 
as well as enhance the quality and breadth of information the validity 
reports can provide the individual department. 

A first priority of future research should be to continue the 
development of practical empirical Bayes models and to test them in 
thi' crucible of cross-validation against more classic systems involving 
various levels of data pooling. Once an acceptable system has been 
developed, a comprehensive examination of the resulting prediction 
equations should be undertaken. This should include a study of the 
incremental contributions to validity of each predictor, once the 
other predictors are accounted for. At that point, a qualitative 
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^nalysis of the contributions of the various predictors by discipline 
should be carried out so that departments that do not employ formal 
prediction systems could still benefit frcno the results. 

In addition, two other problems must be addressed. One concerns 
the use of unique predictors by individual departments. The empirical 
Bayes framework must be suitably expanded to include such predictors , 
while maintaining the desired stability in the final prediction equa- 
tions. Second, the question of differential validity for various 
subgroups of candidates must be tacked in a somewhat different 
fashion. It is possible that a hybrid of various models can be 
formulated to overcome the paucity of data on minorities. 

The empirical Bayes framework is undoubtedly rich enougn to 
accommodate even this difficulty. Our work suggests that the GRE 
Validity Study Service will benefit from incorporating empirical 
H.iyes ideas. Although the models we have experimented with have 
quite reasonable properties, we are convinced that further research 
will uncover still mi)re powerful ways of looking at validity data* 
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Technical Appendix 

The estimation problem following from the empirical Bayes model 
may be described as follows. Suppose there are m departments with n. 
students in department i« We assume that a linear model of the foW 
lowing form holds: 

(1) Y. - X.B, + e. (i - 1, 2, m) 

I 1 1 I Iff 

where 

Y. is an n^ X I vector of first-year averages, 

X. is an n. X P design matrix containing information on 

P test scores of the students, 

B. is a P X I vector of regression coefficients, and 

e. i s an n . X 1 vector of random errors. 
1 I 

and X. are observed, while B* and e. are not. Interest centers 
on obtaining estimates of {B^} . Ordinarily, one assumes that the 
components of e. are a rarodom sample from a nonaal distribution, NCO, a^2^ 

The key assumption in our formulation of an emprirical Bayes model 
is that 

(2) B = Z G ^ D 

where B is an m X P matrix of regression coefficients where the ith row 
of B is denoted B* . , 

Z is an m X k matrix of departmental covariates where the ith row 
of Z is denoted and contains information on k variates, 

G is a k X P matrix of regression coefficients, 

0 is an m X P matrix of random errors where the ith row of D 
is denoted . . 

It is assumed that the D' ^ values are distributed according to a 



mu 



Itivariate normal distribution, N(0,r*)< 
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Under the models (1) and (2), together with the accompanying 
normality asavnaptions , the {b.} estimates have a joint multivariate 
nonnal distribution given the observed data and the parameters G and r*. 
The empirical Bayes estimate of the (B.) is the vector of means of 
this posterior distribution with G and replaced by their maximum 
likelihood estimates. These estimates are obtained through application 
of the EM algoritto (Dempster, Laird, & Rubin. 1977). The algorithm 
consists of a succession of two-step cycles: an E-step and an 
M-8tep. 

Beginning with initial estimates of a^.. G, and E*. we obtain 

foi- the E-step: 

r. - ECB-Iy., X., a^j, G, E*) 
I I'll I 



and 



(P* ♦ P, )' P*G' Z. ♦ P.B. 
^ I I 11 



S, = E(B.B.'1y.. X^, ^.^ G. T*) 

" -1 
= r.r. • ♦ CP* + P- ) 
1 1 1 



whore P* = P^ = CJ."^ ^i'^i' ^i ^^^^^ squares 

estimate of B.. based on data from department i only. Thus, the 
current estimate of the posterior mean of B. is a precision-weighted 
combination of the least squares -stimate and the appropriate point 
on the turface defined by equation (2). The quantity S. is required 
for the M-step. 

2 

In the M-step, we obtain updated estimates of a ^, G, and Z* based 

and S.}. In effect, we regress the r. (in place of the 
i 1 



on r 
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unobservable B.) on Z.. Thus 
1 I 

G - (Z'Z)'^ Z'R 
J: » xtT^ ft'B - G'(Z'Z)G 

and 

a^, « nT^ Y/Y. - ZrVX.'Y. ♦ s,^'^ w.?^^ , 
I 1 II I I I j pk jk jk ' 

where S. « (s.^^^), X.'X. - (w.J^^), and R is a matrix whose ith 
I jk * 1 I jk ' 

row is r . ' • 
I 

Tht. E-step is then reentered with the updated estimates, and new 
conditional expectations are calculated. The process is continued 
until the estimates of G, JC*, and {cJ, } converge to the maximum 
likelihood estimates. The corresponding (r^) values are the desired 
empirical Bayes estimates of (b^) . 

It should be noted that we do not attempt to obtain simultaneously 
empirical Bayes estimates of ^B^^ and (a ) « This more complicated 
problem cannot be solved directly, and extensive niuaerical calculations 
are required to obtain even posterior modes as estimates of the 
parameters. The quality of such estimates is not clear. We prefer, 
therefore, to uncouple the estimation problems for the residual 
variances and the regression coefficients. 



82 



